Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end fine-tuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency.
translated by 谷歌翻译
视频介绍的关键是使用尽可能多的参考帧中的相关信息。现有基于流的传播方法将视频合成过程分为多个步骤:流程完成 - >像素传播 - >综合。但是,存在一个很大的缺点,即每个步骤中的错误继续在下一步中积累和放大。为此,我们为流提供的视频介绍(ECFVI)提出了一个错误补偿框架,该框架利用基于流的方法并抵消了其弱点。我们通过新设计的流程完成模块和利用错误指南图的错误补偿网络来解决弱点。我们的方法极大地提高了时间的一致性和完整视频的视觉质量。实验结果表明,与最先进的方法相比,我们提出的方法的卓越性能随X6的速度提高了。此外,我们通过补充现有测试数据集的弱点来提出一个新的基准数据集,以评估。
translated by 谷歌翻译
我们基于以下假设,即明确面向对象的信息可能是理解整个序列的上下文,我们介绍了一个新的范式用于离线视频实例分割(VIS)。为此,我们提出了Vita,这是一个简单的结构,建立在基于现成的变压器的图像实例分割模型之上。具体而言,我们使用图像对象检测器作为将特定于对象的上下文提炼为对象令牌的一种手段。 Vita通过在不使用时空主链功能的情况下关联框架级对象令牌来完成视频级别的理解。通过使用凝结信息在对象之间有效建立关系,Vita用Resnet-50骨架在VIS基准上实现了最新的关系:49.8 AP,45.7 AP在YouTube-VIS 2019和2021和2021和19.6 AP上的AP上的Ovis上。此外,由于其基于对象令牌的结构与骨干功能脱节,Vita显示了以前的离线VIS方法未探索的几个实际优势 - 使用常见的GPU处理长长和高分辨率的视频,并冻结框架级检测器在图像域进行训练。代码将在https://github.com/sukjunhwang/vita上提供。
translated by 谷歌翻译
对于在线视频实例分段(VI),以有效的方式充分利用来自先前帧的信息对于实时应用是必不可少的。最先前的方法遵循一个两级方法,需要额外的计算,例如RPN和Roialign,并且在VI中的所有子任务中没有完全利用视频中的可用信息。在本文中,我们提出了一种基于网格结构特征表示构建的在线VI的新颖单级框架。基于网格的功能允许我们使用完全卷积的网络进行实时处理,并且还可以轻松地重用和共享不同组件内的功能。我们还介绍了从可用帧中聚合信息的协同操作模块,以便丰富VI中所有子任务的功能。我们的设计充分利用了以高效的方式为所有任务的网格形式提供了以前的信息,我们在YouTube上实现了新的最先进的准确性(38.6 AP和36.9 AP)和速度(40.0fps) - 2019年和2021年在线VIS方法之间的数据集。
translated by 谷歌翻译
通用事件边界检测(GEBD)是一个新建议的视频了解任务,旨在找到事件的一个级别更深入的语义边界。桥接自然人感知和视频理解之间的差距,它具有各种潜在的应用,包括可解释和语义有效的视频解析。仍处于早期发展阶段,现有的Gebd求解器是相关视频理解任务的简单扩展,无视Gebd的独特特征。在本文中,我们向无监督/监督Gebd提出了一种新颖的框架,通过使用时间自相似性矩阵(TSM)作为视频表示。新的递归TSM解析(RTP)算法利用TSM中的本地对角线模式来检测边界,与边界对比(BOCO)丢失相结合,以培训我们的编码器以产生更多的信息性TSM。我们的框架可以应用于无监督和监督的设置,通过Gebd基准的巨大边缘实现最先进的性能。特别是,我们无监督的方法优于以前的最先进的“监督”模型,这意味着它的卓越效果。
translated by 谷歌翻译
Scene text images have different shapes and are subjected to various distortions, e.g. perspective distortions. To handle these challenges, the state-of-the-art methods rely on a rectification network, which is connected to the text recognition network. They form a linear pipeline which uses text rectification on all input images, even for images that can be recognized without it. Undoubtedly, the rectification network improves the overall text recognition performance. However, in some cases, the rectification network generates unnecessary distortions on images, resulting in incorrect predictions in images that would have otherwise been correct without it. In order to alleviate the unnecessary distortions, the portmanteauing of features is proposed. The portmanteau feature, inspired by the portmanteau word, is a feature containing information from both the original text image and the rectified image. To generate the portmanteau feature, a non-linear input pipeline with a block matrix initialization is presented. In this work, the transformer is chosen as the recognition network due to its utilization of attention and inherent parallelism, which can effectively handle the portmanteau feature. The proposed method is examined on 6 benchmarks and compared with 13 state-of-the-art methods. The experimental results show that the proposed method outperforms the state-of-the-art methods on various of the benchmarks.
translated by 谷歌翻译
在本文中,我们建议利用对话的独特特征,共享参与者的常识性知识,以解决总结它们的困难。我们提出了病态的框架,该框架使用常识推论作为其他背景。与以前仅依赖于输入对话的工作相比,Sick使用外部知识模型来生成丰富的常识推断,并选择具有基于相似性选择方法的最可能的推理。基于生病的,病人++的理解为监督,在总结多任务学习环境中的对话时,添加了产生常识推断的任务。实验结果表明,通过注入常识性知识,我们的框架比现有方法产生更多信息和一致的摘要。
translated by 谷歌翻译
A deep learning strategy is developed for fast and accurate gas property measurements using flame emission spectroscopy (FES). Particularly, the short-gated fast FES is essential to resolve fast-evolving combustion behaviors. However, as the exposure time for capturing the flame emission spectrum gets shorter, the signal-to-noise ratio (SNR) decreases, and characteristic spectral features indicating the gas properties become relatively weaker. Then, the property estimation based on the short-gated spectrum is difficult and inaccurate. Denoising convolutional neural networks (CNN) can enhance the SNR of the short-gated spectrum. A new CNN architecture including a reversible down- and up-sampling (DU) operator and a loss function based on proper orthogonal decomposition (POD) coefficients is proposed. For training and testing the CNN, flame chemiluminescence spectra were captured from a stable methane-air flat flame using a portable spectrometer (spectral range: 250 - 850 nm, resolution: 0.5 nm) with varied equivalence ratio (0.8 - 1.2), pressure (1 - 10 bar), and exposure time (0.05, 0.2, 0.4, and 2 s). The long exposure (2 s) spectra were used as the ground truth when training the denoising CNN. A kriging model with POD is trained by the long-gated spectra for calibration, and then the prediction of the gas properties taking the denoised short-gated spectrum as the input: The property prediction errors of pressure and equivalence ratio were remarkably lowered in spite of the low SNR attendant with reduced exposure.
translated by 谷歌翻译
现有的修剪技术保留了深层神经网络的整体能力,可以做出正确的预测,但在压缩过程中也可能会扩大隐藏的偏见。我们提出了一种新颖的修剪方法,即公平意识的梯度修剪法(Fairgrape),可最大程度地减少修剪对不同子组的不成比例的影响。我们的方法计算了每个模型权重的范围重要性,并选择了一部分权重,以维持相对组间的修剪中的总重要性。然后,提出的方法将具有较小重要性值的修剪网络边缘,并通过更新重要性值来重复该过程。我们在四个不同的数据集(Fairface,utkface,celeba和Imagenet)上演示了方法的有效性,用于面部属性分类的任务,其中我们的方法将性能降解的差异降低了90%,高达90% - 阿尔特修剪算法。我们的方法在较高的修剪率(99%)的环境中更有效。实验中使用的代码和数据集可在https://github.com/bernardo1998/fairgrape上获得
translated by 谷歌翻译
对象接地任务旨在通过口头通信定位图像中的目标对象。了解人类命令是有效人体机器人通信所需的重要过程。然而,这是具有挑战性的,因为人类命令可能是暧昧和错误的。本文旨在消除人类的引用表达式,允许代理基于从场景图获得的语义数据提出相关问题。我们测试如果我们的代理可以从场景图之间使用对象之间的关系,以便询问可以消除原始用户命令的语义相关问题。在本文中,我们使用场景图(IGSG)提出增量接地,消歧模型使用从图像场景图和语言场景图到基于人类命令的地面对象的语义数据的歧义模型。与基线相比,IGSG显示了有希望的成果,在有多个相同的目标对象的复杂现实场景中。 IGSG可以通过要求消除歧义问题回到用户来有效消除歧义或错误的表达式。
translated by 谷歌翻译